Downloads & Proxy Management

Downloading files (PDF, images, ZIPs, …)

client.scrape(url, browser=False) handles binary responses natively since v0.6.0 — the response carries content_type, body_base64, and a requests.Response-style API (is_binary, body, text, save()). Same code path as a normal page scrape; the response shape tells you what came back.

resp = client.scrape("https://investors.example.com/charter.pdf", browser=False)
if resp.is_binary:
    resp.save("charter.pdf")              # one-liner write
    # data = resp.body                     # bytes (mirrors requests.Response.content)
    # ct = resp.content_type               # "application/pdf"
else:
    print(resp.content)                   # text response (markdown / html)

Use browser=False for direct file downloads — browser=True is 5 credits and adds no value when the target IS the file (it's only useful when the file is behind JavaScript / a viewer / SPA navigation). See the gotchas at the bottom of this page for that case.

resp.text returns None for binary responses (forces an explicit is_binary branch instead of silently parsing base64 as text). resp.body always returns bytes regardless of MIME — text responses are UTF-8-encoded for you.

With a proxy

Same kwarg as any scrape:

resp = client.scrape("https://example.com/file.pdf", browser=False, use_proxy="US")
resp.save("file.pdf")

Migrating from `download()`

client.download(url) is deprecated since v0.7.0 (still works, emits a DeprecationWarning; scheduled for removal in v1.0). The replacement is the same scrape(browser=False) call shown above — scrape returns binary content natively, so download no longer carries its weight.

# Before
result = client.download(url)
import base64
data = base64.b64decode(result.content)

# After
resp = client.scrape(url, browser=False)
resp.save("file.pdf")          # or: data = resp.body

Batch downloads (many files at once)

submit_batch + iter_results() streams binary responses just like any other batch. Every yielded ScrapeResponse carries the same is_binary / save / body surface — no separate API for downloads.

from scrapingpros import AsyncClient

async def download_pdfs(urls, outdir):
    items = [
        {"url": u, "custom_id": doc_id, "browser": False}
        for doc_id, u in urls.items()
    ]
    async with AsyncClient(token) as client:
        batch = await client.submit_batch("pdfs-daily", items)
        async for r in batch.iter_results():
            if not r.guidance.success:
                log.warning("failed %s: %s", r.url,
                            r.guidance.error_type)
                continue
            if r.is_binary:
                r.save(f"{outdir}/{r.custom_id}.pdf")
            else:
                # Server returned HTML instead of a file —
                # usually a redirect / 404 page / login wall.
                log.info("non-binary response from %s, skipping", r.url)

Memory stays constant — the streaming iterator never holds the full result list in RAM, so this scales to tens of thousands of files. Disk writes happen one at a time as each download completes.

If you need a list-return shape for simpler call sites:

results = client.batch_scrape([
    {"url": u, "custom_id": doc_id, "browser": False}
    for doc_id, u in urls.items()
])
for r in results:
    if r.guidance.success and r.is_binary:
        r.save(f"{outdir}/{r.custom_id}.pdf")

Crash-resilient batch downloads (since v0.7.5)

For pipelines that crash and restart, persist batch.last_completed_at alongside (collection_id, run_id) and resume from the cursor:

# Submit + persist as you go
batch = await client.submit_batch("pdfs-daily", items)
db.save(cid=batch.collection_id, rid=batch.run_id,
        submitted=batch.submitted_count)

async for r in batch.iter_results():
    if r.is_binary:
        r.save(f"{outdir}/{r.custom_id}.pdf")
    db.update(cid=batch.collection_id, cursor=batch.last_completed_at)

# After a restart, resume strictly after the saved cursor:
row = db.find(cid)
async for r in client.iter_results(row.cid, row.rid,
                                    since=row.cursor,
                                    submitted_count=row.submitted):
    if r.is_binary:
        r.save(f"{outdir}/{r.custom_id}.pdf")

See Batch API → Cross-process resume for the full pattern (including the same-millisecond ties caveat).

Large files (over the inline cap)

The jobs listing embeds bodies inline up to ~256 KB (server-side cap). Larger PDFs come back via the per-job result endpoint instead — the SDK does this automatically inside _build_result, so callers see the same ScrapeResponse shape with body_base64 / is_binary / save() populated. No special handling on your side. The only difference is one extra round-trip to fetch that body, which doesn't cost credits — only wire bandwidth.

Gotchas

JS-rendered PDF viewers

If the URL is a viewer wrapper (https://site.com/viewer.html?file=...), scrape(url, browser=False) returns the HTML of the viewer page, not the PDF. You need browser=True to render the wrapper, extract the real PDF URL from the DOM, then a second scrape with browser=False to download:

viewer = client.scrape(viewer_url, browser=True, actions=[
    WaitForSelectorAction(selector="css:iframe[src*='.pdf']", time=8000),
])
real_pdf_url = extract_pdf_url(viewer.html)
pdf = client.scrape(real_pdf_url, browser=False)
pdf.save("doc.pdf")

This costs 5 credits for the render + 1 for the download = 6 total per file. Use it only when the direct path doesn't work.

The SDK does not maintain a session between scrapes. If the PDF is behind a login form, do the login as a separate MethodPOST(content_type="form") scrape, extract the resulting auth cookie / token from network_capture, and pass it on the file download — or model the whole flow as a single browser=True scrape with actions=[InputAction(...), ClickAction(...), WaitForSelectorAction(...)]. There's no one-size-fits-all recipe here; it depends on the site.

Verifying you got the file you expected

r.content_type carries the MIME the server returned ("application/pdf", "image/png", "application/zip", …). Branch on it when you process a mixed list of URLs and want to route by file type:

async for r in batch.iter_results():
    if not r.guidance.success or not r.is_binary:
        continue
    if r.content_type == "application/pdf":
        r.save(f"pdfs/{r.custom_id}.pdf")
    elif r.content_type and r.content_type.startswith("image/"):
        ext = r.content_type.split("/")[1].split(";")[0]
        r.save(f"images/{r.custom_id}.{ext}")
    else:
        r.save(f"misc/{r.custom_id}.bin")

Proxy management

List available countries

resp = client.list_proxy_countries()
print(resp.countries)  # ["AD", "AE", "AF", ...] — 200+ countries

Request a country proxy

Country-specific proxies require approval:

resp = client.request_proxy_country("US", reason="Need US pricing data")
print(resp.status)  # "pending" or "already_approved"

Check approval status

status = client.proxy_status()
print(status.approved_countries)  # ["US", "BR"]
print(status.pending_countries)   # ["DE"]

Use a country proxy

result = client.scrape("https://example.com", use_proxy="US")

Plans and billing

# View all plans
plans = client.plans()

# Check current month billing
billing = client.billing()
print(billing.month)

# Usage metrics
metrics = client.client_metrics(date="2026-04")

# API health
health = client.health()

Downloading files (PDF, images, ZIPs, …)​

With a proxy​

Migrating from download()​

Batch downloads (many files at once)​

Crash-resilient batch downloads (since v0.7.5)​

Large files (over the inline cap)​

Gotchas​

JS-rendered PDF viewers​

Files behind auth / login walls​

Verifying you got the file you expected​

Proxy management​

List available countries​

Request a country proxy​

Check approval status​

Use a country proxy​

Plans and billing​